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-~ , Abstract 

This paper introduces the Furthest Hyperplane Problem (FHP), which is an unsupervised 
counterpart of Support Vector Machines. Given a set of n points in R d , the objective is to 
produce the hyperplane (passing through the origin) which maximizes the separation margin, 
that is, the minimal distance between the hyperplane and any input point. 

f) ■ To the best of our knowledge, this is the first paper achieving provable results regarding 

r \ | FHP. We provide both lower and upper bounds to this NP-hard problem. First, we give a 

simple randomized algorithm whose running time is n°^ l ' e ' where 6 is the optimal separation 
margin. We show that its exponential dependency on I/O 2 is tight, up to sub-polynomial factors, 
assuming SAT cannot be solved in sub-exponential time. Next, we give an efficient approxima- 
tion algorithm. For any a S [0, 1], the algorithm produces a hyperplane whose distance from at 
least 1 — 5a fraction of the points is at least a times the optimal separation margin. Finally, 

qq | we show that FHP does not admit a PTAS by presenting a gap preserving reduction from a 

I/") ■ particular version of the PCP theorem. 
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1 Introduction 

One of the most well known and studied objective functions in machine learning for obtaining linear 
classifiers is the Support Vector Machines (SVM) objective. SVM's are extremely well studied, 
both in theory and in practice. We refer the reader to |I], || and to [[ J for a thorough survey 
and references therein. The simplest possible setup is the separable case: given a set of n points 
{x^'}f =1 in M. d and labels yi,...y n £ {1,-1} find hyperplane parameters w € S d ~ 1 and 6 £ S 
which maximize 9' subject to {{w,x^ l >) + b)yi > 9'. The intuition is that different concepts will be 
"well separated" from each other and that the best decision boundary is the one that maximizes 
the separation. This intuition is supported by extensive research which is beyond the scope of 
this paper. Algorithmically, the optimal solution for this problem can be obtained using Quadratic 
Programing or the Ellipsoid Method in polynomial time. In cases where the problem has no feasible 
solution the constraints must be made "soft" and the optimization problem becomes significantly 
harder. This discussion, however, also goes beyond the scope of this paper. 

As a whole, SVM's fall under the category of supervised learning although semi-supervised and 
unsupervised versions have also been considered (see references below). We note that to the best 
of our knowledge the papers dealing with the unsupervised scenario were purely experimental and 
did not contain any rigorous proofs. In this model, the objective remains unchanged but some (or 
possibly all) of the point labels are unknown. The maximization thus ranges not only over the 
parameters w and b but also over the possible labels for the unlabeled points j/j E {1,-1}. The 
integer constraints on the values of yi make this problem significantly harder than SVM's. 

In P] Xu et al coin the name Maximal Margin Clustering (MMC) for the case where none of the 
labels are known. Indeed, in this setting the learning procedure behaves very much like clustering. 
The objective is to assign the points to two groups (indicated by yi) such that solving the labeled 
SVM problem according to this assignment produces the maximal margin. f] In || Bennett and 
Demiriz propose to solve the resulting mixed integer quadratic program directly using general 
solvers and give some encouraging experimental results. De Bie et al || and Xu et al ||] suggest an 
SDP relaxation approach and show that it works well in practice. Joachims in || suggests a local 
search approach which iteratively improves on a current best solution. While the above algorithms 
produce good results in practice, their analysis does not guaranty the optimality of the solution. 
Moreover, the authors of these papers state their belief that the non convexity of this problem 
makes it hard but to the best of our knowledge no proof of this was given. 

FHP is very similar to unsupervised SVM or Maximum Margin Clustering. The only difference 
is that the solution hyperplane is constrained to pass through the origin. More formally, given n 
points {a;W}" =1 in a <i-dimensional Euclidean space, FHP is defined as follows: 

Maximize 9' 

s.t |H| 2 = i 

Vl<i<n \{wx {i) )\ > 9' (1) 

The labels in this formulation are given by yi = sign((w ■ x^ 1 ')) which can be viewed as the 
"side" of the hyperplane to which x^ l > belongs. At first glance, MMC appears to be harder than 
FHP since it optimizes over a larger set of possible solutions. Namely, those for which b (the 
hyperplane offset) is not necessarily zero. We claim however that any MMC problem can be solved 
using at most Q) invocations of FHP. The observation is that any optimal solution for MMC must 
have two equally distant points in opposite sides of the hyperplane. Therefore, there always are at 
least two points i and j such that ((w,x^ 1 ') + b) = —((w,x^'} + b). This means that the optimal 



The assignment is required to label at least one point to each cluster to avoid a trivial unbounded margin. 



hyperplane obtained by MMC must pass through the point (x^ 1 ' + x^>)/2. Therefore, solving FHP 
centered at (o;W + x^')/2 will yield the same hyperplane as MMC and iterating over all pairs of 
points concludes the observation. From this point on we explore FHP exclusively but the reader 
should keep in mind that any algorithmic claim made for FHP holds also for MMC due to the 
above. 

1.1 Results and techniques 

In Section [| we begin by describing three exact (yet exponential) algorithms for FHP. These turn 
out to be preferable to one another for different problem parameters. The first is a brute force 
search through all feasible labelings which runs in time n ' > . The second looks for a solution by 
enumerating over an e-net of the d-dimensional unit sphere and requires {l/9)°^ d > operations. The 
last generates solutions created by random unit vectors and can be shown to find the right solution 
after n°^ 1 ' 9 > tries (w.h.p.). While algorithmically the random hyperplane algorithm is the simplest, 
its analysis is the most complex. Assuming a large constant margin, which is not unrealistic in 
machine learning applications, it provides the first polynomial time solution to FHP. Unfortunately, 
due to the hardness result below, its exponential dependency on 9 cannot be improved. 

In section [| we show that if one is allowed to discard a small fraction of the points then much 
better results can be obtained. We note that in the perspective of machine learning, a hyperplane 
that separates almost all of the points still provides a meaningful result (see the discussion at the 
end of section f|) . We give an efficient algorithm which finds a hyperplane whose distance from 
at least 1 — 5a fraction of the points is at least a6 , where a £ [0, 1] is any constant and 9 is the tivation 
optimal margin of the original problem. The main idea is to first find a small set of solutions which 
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perform well 'on average'. These solutions are the singular vectors of row reweighed versions of a °inJTi 
matrix containing the input points. We then randomly combine those to a single solutions. nJlxed 



In section g we prove that FHP is NP-hard to approximate to within a small multiplicative 
constant factor, ruling out a PTAS. We present a two-step gap preserving reduction from MAX- 
3SAT using a particular version of the PCP theorem Ml. It shows that the problem is hard even 
when the number of points is linear in the dimension and when all the points have approximately 
the same norm. As a corollary of the hardness result we get that the running time of our exact 
solution algorithm is, in a sense, optimal. There cannot be an algorithm solving FHP in time 
n ' ' ' for any constant e > 0, unless SAT admits a sub-exponential time algorithm. 

2 Preliminaries and notations 

The set {x'^}" =1 of input points for FHP is assumed to lie in a Eucledean space Mr, endowed 
with the standard inner product denoted by (•,•). Unless stated otherwise, we denote by || • || the 
0.2 norm. Throughout the paper we let 9 denote the solution of the optimization problem defined 
in Equation (|l]). The parameter 9 is also referred to as "the margin" of {x^ l '}f =1 or simple "the 
margin" when it is obvious to which set of points it refers to. Unless stated otherwise, we consider 
only hyperplanes which pass through the origin. They are defined by their normal vector w and 
include all points x for which (w, x) = 0. By a slight abuse of notation, we usually refer to a 
hyperplane by its defining normal vector w. Due to the scaling invariance of this problem we 
assume w.l.o.g. that ||x"'|| < 1. One convenient consequence of this assumption is that 9 < 1. 
We denote by A/ r (//, a) the standard Normal distribution with mean [i and standard deviation a. 
Unless stated otherwise, log() functions are base 2. 
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Definition 2.1 (Labeling, feasible labeling). We refer to any assignment of y±, . . . ,y n £ {1,-1} 
as a labeling. We say that a labeling is feasible if there exists w G § rf_1 such that \/i yi (w, x^ l >) > 0. 
Complementary, for any hyperplane w £ S^" 1 we define its labeling as yi = sign((w,x^>\). 

Definition 2.2 (Labeling margin). The margin of a feasible labeling is the margin obtained by 
solving SVM on {x^'}^ =1 using the corresponding labels but constraining the hyperplane to pass 
through the origin. This problem is polynomial time solvable by Quadratic Programing or by the 
Ellipsoid Method J^/. We say a feasible labeling is optimal if it obtains the maximal margin. 

Definition 2.3 (Expander Graphs). An undirected graph G = (V, E) is called an (n, d, r)-expander 
if\V\ = n, the degree of each node is d, and its edge expansion h(G) = mm|5| <n / 2 (|£ : (S', 5 C )|)/|5| is 
at least r. By Cheeger's inequality §w], h(G) > (d— A)/2, where A is the second largest eigenvalue, 
in absolute value, of the adjacency matrix of G. For every d = p + 1 > 14, where p is a prime 
congruent to 1 modulo 4, there are explicit constructions of (n, d, r)- expanders with r > d/5 for 
infinitely many n. This is due to the fact that these graphs exhibit A < 2\fd — 1 (see ^l\l), and 
hence by the above h(G) > (d — 2yd — l)/2 > d/5 (say) for d > 14. Expander graphs will play a 
central role in the construction of our hardness result in section |^. 

3 Exact algorithms 

3.1 Enumeration of feasible labelings 

The most straightforward algorithm for this problem enumerates over all feasible labelings of the 
points and outputs the one maximizing the margin. Note that there are at most n d+1 different 
feasible labelings to consider. This is due to Sauer's Lemma p | and the fact that the VC dimension 
of hyperplanes in M. d is d + l.p] This enumeration can be achieved by a Breadth First Search 
(BFS) on the graph G(Y, E) of feasible labelings. Every node in the graph G is a feasible labeling 
(|y| < n +1 ) and two nodes are connected by an edge iff their corresponding labelings defer by 
at most one point label. Thus, the maximal degree in the graph is n and the number of edges 
in this graph is at most \E\ < \Y\n < n +2 . Moreover, computing for each node its neighbors 
list can be done efficiently since we only need to check the feasibility (linear separability) of at 
most n labelings. Performing BFS thus requires at most 0(|y|poly(n, d) + \E\ log(|-E|)) = n + ' '. 
The only non trivial observation is that the graph G is connected. To see this, consider the path 
from a labeling y to a labeling y' . This path exists since it is achieved by rotating a hyperplane 
corresponding to y to one corresponding to y' . By an infinitesimal perturbation on the point set 
(which does not effect any feasible labeling) we get that this rotation encounters only one point at 
a time and constitutes a path in G. To conclude, there is a simple enumeration procedure for all 
n +1 linearly separable labelings which runs in time n + W . 

3.2 An £-net algorithm 

The second approach is to search through a large enough set of hyperplanes and measure the 
margins produced by the labelings they induce. Note that it is enough to find one hyperplane which 
obtains the same labels the optimal margin one does. This is because having the labels suffices for 
solving the labeled problem and obtaining the optimal hyperplane. We observe that the correct 
labeling is obtained by any hyperplane w whose distance from the optimal one is ||u; — u>*|| < 9. 
To see this, let y* denote the correct optimal labeling y*(w, i®) = (w*, y*x^>) + (w — w*,y*x^) > 



2 Sauer's Lemma [J12| states that the number of possible feasible labelings of n data points by a classifier with VC 
dimension dvc is bounded by n dvc . 



9 — \\w — w*\\ ■ ||x^|| > 0. It is therefore enough to consider hyperplane normals w which belong 
to an e-net on the sphere S with e < 9. Deterministic constructions of such nets exist with size 
(1/0) ^ (T^j. Enumerating all the points on the net produces an algorithm which runs in time 
O((l/0)°( rf )poly(n,d)).g 

3.3 Random Hyperplane Algorithm 

Both algorithms above are exponential in the dimension, even when the margin 9 is large. A first 
attempt at taking advantage of the large margin uses dimension reduction. An easy corollary of the 
well known Johnson-Lindenstrauss lemma yields that randomly projecting of the data points into 
dimension 0(log(n)/9 2 ) preserves the margin up to a constant. Then, applying the e-net algorithm 
on the reduced space requires only n°' log ( 1 ' 61 )' 6 ' ' operations. Similar ideas were introduced in [fi~ 



and subsequently used in [15| [hj, 17fl . It turns out, however, that a simpler approach improves on 
this. Namely, pick n w - 1 unit vectors w uniformly at random from the unit sphere. Output the 
labeling induced by one of those vectors which maximizes the margin. To establish the correctness 
of this algorithm it suffices to show that a random hyperplane induces the optimal labeling with a 
large enough probability. 

Lemma 3.1. Let w* and y* denote the optimal solution of margin of 9 and the labeling it induces. 
Let y be the labeling induced by a random hyperplane w. The probability that y = y* is at least 

n -O(l/0 2 )_ 

The proof of the lemma is somewhat technical and deferred to Appendix [AJ The assertion of the 
lemma may seem surprising at first. The measure of the spherical cap of vectors w whose distance 
from w* is at most 9 is only ~ 9 . Thus, the probability that a random w falls in this spherical 
cap is very small. However, we show that it suffices for w to merely have a weak correlation with 
w* in order to guarantee that (with large enough probability) it induces the optimal labeling. 

Given Lemma |3.1| , the Random Hyperplane Algorithm is straightforward. It randomly samples 
n l 1 ' ', computes their induced labelings, and output the labeling (or hyperplane) which admits 
the largest margin. If the margin 9 is not known, we use a standard doubling argument to enumerate 
it. The algorithm solves FHP w.h.p. in time n . ' • ' '. 

Tightness of Our Result A corollary of our hardness result (Theorem |5,4| ) is that, unless 
SAT has sub-exponential time algorithms, there exist no algorithm for FHP whose running time is 



n 



O(0i/(2-C) 



' for any £ > 0. Thus, the exponential dependency of the Random Hyperplane Algorithm 



on 9 is optimal. This is since the hard FHP instance produced by the reduction in Theorem 1^4 
from SAT has n points in M. d with d = 0(n) where the optimal margin is 9 = fi(l/vd). Thus, if 
there exists an algorithm which solves FHP in time n ' " ' , it can be used to solve SAT in time 

20(n 1 "^/ 2 log(ra)) _ 2°i n ) 

4 Approximation algorithm 

In this section we present an algorithm which approximates the optimal margin if one is allowed 
to discard a small fraction of the points. For any a > it finds a hyperplane whose distance from 
(1 — 0(a))-fraction of the points is at least a times the optimal margin 9 of the original problem. 



3 This procedure assumes the margin 6 is known. This assumption can be removed by a standard doubling 
argument. 



Consider first the problem of finding the hyperplane whose average margin is maximal. That 
is, w E S^ 1 which maximizes Ei (w,x( 1 '} . This problem is easy to solve. The optimal w is the 
top right singular vector the matrix A whose i'th row contains x^ % >. In particular, if we assume the 
problem has a separating hyperplane w* with margin 6, then Ei {w,x^ % '^ > Ei (w*,xW) > 9 2 . 
However, there is no guarantee that the obtained w will give a high value of | (w,xW) | for all i. 
It is possible, for example, that | (w,x^) \ = 1 for 9 2 n points and for all the rest. Our first goal 
is to produce a set of vectors w^ 1 ', . . . ,w^ 1 ' which are good on average for every point. Namely, 
V i Ej (w^\x^) = Q,(9 2 ). To achieve this, we adaptively re-weight the points according to their 
distance from previous hyperplanes, so that those which have small inner product will have a larger 
influence on the average in the next iteration. We then combine the hyperplanes using random 
Gaussian weights in order to obtain a single random hyperplane which is good for any individual 
point. 

Algorithm 1 Approximate FHP Algorithm 



Input: Set of points {z W }™ =1 E R d 
Output: we§" 

T\{i) <i— 1 for all i E [n] 

while £r=i r i00 > 1/n do 

Aj <— n x d matrix whose i'th row is yTj{%) • x^> 
w^' ^— top right singular vector of Aj 
<Tj(i) <- \(x®,v)W>)\ 

3^3 + ^ 
end while 



w' <" Y,)=i9j ■ w {3) for 9j ~ AT (0,1) 
return: w <— w'/\\w'\\ 



Claim 4.1. The main loop in Algorithm II terminates after at most t < 41og(re)/# 2 iterations. 

Proof. Fix some j. Define r,- = Y^i=i T i(^)- We know that for some unit vector w* (the optimal 
solution to the FHP) it holds that | (xW,u;*) | > 9 for all i. Also since w^' maximizes the expression 
||A,-u>|| 2 we have: 

^2o-](i)Tj(i) = \\A jW (j) \\ 2 > \\A jW *\\ 2 =^Tj{i) • ^ { \w*y > Tj ■ 9 2 . 
i i 

It follows that Tj + i = Tj — ^2iO~ 2 (i)Tj(i)/2 < tj(1 — 9 2 /2) and the claim follows by elementary 
calculations since T\ = n. D 



Claim 4.2. Let o~i = v/^7=i °"f (*)• When Algorithm [z| terminates, for each i it holds a 2 > log(ra). 
Proof. Fix i E [n], we know that when the process ends, Tt(i) < r t < 1/n. As T\(i) = 1 we get that 



1/n > Tt (i) = n(i) • II( 1 " ^)l 2 ) = Il( 1 - ^)l 2 ) * n 2 
j=\ j=\ j=\ 



-«?« 



The last inequality holds since 1 — x/2 >2 X for < x < 1. By taking logarithms from both sides 



we get that V ■ a 2 (i) > log(n) as claimed. □ 



Lemma 4.3. Let < a < 1. Algorithm^ outputs a random w E S^ 1 such that with probability at 
least 1/10 at most an 5a fraction of the points are such that 



M) 



,w)\ < a9. 



Proof of Lemma <{.■£■ First, by Markov's inequality and the fact that E[||«; 
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t we have that 
\\w'\\ < 2y/i w.p. at least 3/4. We assume this to be the case from this point on. Now we bound 
the probability that the algorithm 'fails' for point i. 
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Since the expected number of failed points is less than ^7= we have, using Markov's inequality 

V Z7T 



again, that the probability that the number of failed points is more than ban is at most 0.65. We 



also might fail with probability at most 0.25 in the case that 
on the two failure probabilities completes the proof. 



| uf 1 1 > 2\ft. Using the union bound 

□ 



Discussion We note that the problem of finding a hyperplane that separates all but a small 
fraction of the points is the non-supervised analog of the well studied soft margin SVM problem. 
The motivation behind the problem, from the perspective of machine learning, is that a hyperplane 
that separates most of the data points is still likely to correctly label future points. Hence, if a 
hyperplane that separates all of the points cannot be obtained, it suffices to find one that separates 
most (e.g. 1 — a fraction) of the data points. The more common setting in which this problem 
is presented is when a separating hyperplane does not necessarily exist. In our case, although a 
separating hyperplane is guaranteed to exist, it is (provably) computationally hard to obtain it. 

5 Hardness of approximation 

The main result of this section is that FHP does not admit a PTAS unless P=NP. That is, obtaining 
a (1 — ^-approximation for FHP is NP-hard for some universal constant e. The main idea is 
straightforward: Reduce from MAX-3SAT for which such a guarantee is well known, mapping each 
clause to a vector. We show that producing a "far" hyperplane from this set of vectors encodes 
a good solution for the satisfiability problem. However, FHP is inherently a symmetric problem 
(negating a solution does not change its quality) while MAX-3SAT does not share this property. 
Thus, we carry out our reduction in two steps: in the first step we reduce MAX-3SAT to a symmetric 
satisfaction problem. In the second step we reduce this symmetric satisfaction problem to FHP. 
It turns out that in order to show that such a symmetric problem can be geometrically embedded 
as a FHP instance, we need the extra condition that each variable appears in at most a constant 
number of clauses, and that the number of variables and clauses is comparable to each other. The 
reduction process is slightly more involved in order to guarantee this. In the rest of this section we 
consider the following satisfaction problem. 

Definition 5.1 (SYM formulas). A SYM formula is a CNF formula where each clause has either 
2 or A literals. Moreover, clauses appear in pairs, where the two clauses in each pair have negated 
literals. For example, a pair with 4 literals has the form 

{x\ V X2 V -1X3 V X4) A (-1X1 V -1X2 V X3 V -1X4). 



G 



We denote by SYM(i) the class o/SYM formulas in which each variable occurs in at most t clauses. 

We note that SYM formulas are invariant to negations: if an assignment x satisfies m clauses in a 
SYM formula than its negation ->x will satisfy the same number of clauses. 

The first step is to reduce MAX-3SAT to SYM with the additional property that each variable 
appears in a constant number of clauses. We denote by MAX-3SAT(i) the class of MAX-3SAT 
formulas where each variable appears in at most t clauses. Theorem [5 ^ is the starting point of our 
reduction. It asserts that MAX-3SAT(13) is hard to approximate. 

Theorem 5.2 (Arora |§, Hardness of approximating MAX-3SAT(13)). Let if be a 3-CNF boolean 
formula on n variables and m clauses, where no variable appears in more than 13 clauses. Then 
there exists a constant 7 > such that it is NP- hard to distinguish between the following cases: 

1. if is satisfiable. 

2. No assignment satisfies more than a (1 — ^-fraction of the clauses of (p. 

5.1 Reduction from MAX-3SAT(13) to SYM(30) 

The main idea behind the reduction is to add a new global variable to each MAX-3SAT(13) clause 
which will determine whether the assignment should be negated or not, and then to add all negations 
of clauses. The resulting formula is clearly a SYM formula. However, such a global variable will 
appear in too many clauses. We thus "break" it into many local variables (one per clause), and 
impose equality constraints between them. To achieve that the number of clauses remains linear 
in the number of variables, we only impose equality constraints based on the edges of a constant 
degree expander graph. The strong connectivity property of expanders ensures that a maximally 
satisfying assignment to such a formula would assign the same value to all these local variables, 
achieving the same effect of one global variable. 

We now show how to reduce MAX-3SAT to SYM, while maintaining the property that each 
variable occurs in at most a constant number of clauses. 

Theorem 5.3. It is NP-hard to distinguish whether a SYM(30) formula can be satisfied, or whether 
all assignments satisfy at most 1 — 5 fraction of the clauses, where 5 = 7/ 16 and 7 is the constant 



in Theorem 5.i 



Proof. We describe a gap-preserving reduction from MAX-3SAT(13) to SYM(30). Given an in- 
stance of MAX-3SAT(13) 92 with n variables y\, . . . , y n and m clauses, construct a SYM formula tp 
as follows: each clause C% S tp is mapped to a pair of clauses Ai = (C$ V ->Zi) and A\ = {C[ V zj) 
where C[ is the same as Cj with all literals negated and zi is a new variable associated only with 
the i-th clause. For example: 

(2/1 V ~^y 2 V y 3 ) — > (yi V ~^y 2 V y 3 V ->Zi) A (-13/1 V y 2 V ~^y 3 V Zi). 

We denote the resulting set of clauses by A. We also add a set of "equality constraints" , denoted B, 
between the variables Z{ and Zj as follows. Let G be an (m, d, r) explicit expander with d = 14 and 



r > d/5 (the existence of such constructions is established in definition |2.3|) . For each edge (i,j) 
of the expander B includes two clauses: (zi V ->Zj) and (->Zi V Zj). Let tp denote the conjunction of 
the clauses in A and B. 

We first note that the above reduction is polynomial time computable; that tp contains M = 
(d + 2)77t = 16m clauses; and that every variable of tp appears in at most t := max{26, 2d + 2} = 30 
clauses. Therefore, tp is indeed an instance of SYM (30). To prove the theorem we must show: 

7 



• Completeness: If (p is satisfiable then so is ip. 

• Soundness: If an assignment satisfies 1 — 8 fraction of ip's clauses then there is an assignment 
that satisfies 1 — 7 of </?'s clauses. 

The completeness is straight-forward: given an assignment j/i, . . . , j/ n that satisfies ip, we can 
simply set z\, . . . , z m to true to satisfy ip. For the soundness, suppose that there exists an assignment 
which satisfies 1 — 8 fraction of V' s clauses, and let v = j/i, ... ,y n ,zi, ... ,z m be a maximally 
satisfying assignment .QJ Clearly, v satisfies at least 1 — 5 fraction of ip's clauses. We can assume 
that at least half of z\ , . . . , z m are set to true since otherwise we can negate the solution while 
maintaining the number of satisfied clauses. 

We first claim that, in fact, all the z^s must be set to true in v. Indeed, let S = {i : z% = false} 
and denote k := \S\ (recall that k < m/2). Suppose k > and let G be the expander graph used 
in the reduction. If we change the assignment of all the variables in S to true, we violate at most k 
clauses from A (as each variable z% appears in exactly 2 clauses, but one of them is always satisfied). 
On the other hand, by definition of G, the edge boundary of the set S in G is at least rk = kd/5, 
and every such edge corresponds to a previously violated clause from B. Therefore, flipping the 
assignment of the variables in S contributes at least kd/5 — k = ¥k — k > k to the number of 
satisfied clauses, contradicting the maximality of v. 

Now, since all the z\s are set to true, a clause C% £ <p is satisfied iff the clause Ai S ip is satisfied. 
As the number of unsatisfied clauses among A\, . . . ,A m is at most 5M = 5(d + 2)m we get that 
the number of unsatisfied clauses in ip is at most 5(d + 2)m = jq • 16m = 7m, as required. □ 

5.2 Reduction from SYM to FHP 

We proceed by describing a gap preserving reduction from SYM(t) to FHP. 

Theorem 5.4. Given {x^ l '}f =1 G M. d , it is NP-hard to distinguish whether the furthest hyperplane 
has margin —7= from all points or at most a margin of (1 — s)—t= for e = £1(8), where 8 is the 



constant in Theorem 5.5. 



Remark 5.5. For convenience and ease of notation we use vectors whose norm is more than 1 but 
at most yl2. The reader should keep in mind that the entire construction should be shrunk by this 
factor to facilitate ||x^||2 < 1- Note that the construction constitutes hardness even for the special 
case where n = 0(d) and for all points IjsjVl < ||x^||2 < 1- 

Proof. Let ip be a SYM(i) formula with d variables y\, ..., y^ and m clauses C\, . . . , C m . We map 
each clause C, to a point x^ 1 ' in K . Consider first clauses with two variables of the form (yj x V yj 2 ) 
with j\ < J2- Let Sj 1 ,Sj 2 G {—1,1} denote whether the variables are negated in the clause, where 

1 means not negated and —1 means negated. Then define the point x^> as follows: ar* = Sj 1 ; 
Xj2 = ~ s j2~i an d xf = for j ^ {31,32}- F° r example: 

(iteVift)— )-(0, 1,-1,0,..., 0). 

For clauses with four variables yj 1 , . . . ,yj 4 with j\ < ... < j\ let Sj 1 ,...,Sj 4 E {—1,1} denote 
whether each variable is negated. Define the point x"> as follows: x- = 3s j^, x- = —Sj r for 
r = 2,3, 4; and ar = for j ^ {j\, . . . , j'4}. For example: 

(-"j/iViteVy 4 V-.j/ 6 )— ►(-3,0, -1,-1, 0,1,0,... ,0). 



4 An assignment which satisfies the maximum possible number of clauses from xp. 



Finally, we also add the d unit vectors e\, . . . , e^ to the set of points (the importance of these 
"artificially" added points will become clear later). We thus have a set of n = m + d points. To 
constitute the correctness of the reduction we must argue the following: 

• Completeness: If t/j is satisfiable there exists a unit vector w whose margin is at least 1/yd. 

• Soundness: If there exists a unit vector w whose margin is at least (1 — e)/y r d then there 
exists an assignment to variables which satisfies 1 — 5 fraction of ?/>'s clauses. 

We first show completeness, let yi,- ■ ■ ,y<i be an assignment that satisfies ip. Define Wi = 1/yd 
if yi is set to true, and Wi = —l/\fd if yi is set to false. This satisfies ||u>||2 = 1- Since the 
coordinates of all points a^ 1 ), . . . , x^ n > are integers, to show that the margin of w is at least 1/yd it 
suffices to show that (w, x^') / for all points. This is definitely true for the unit vectors e±, . . . , e^. 
Consider now a point x^ 1 ' which corresponds to a clause Cj. We claim that if (w,x^'j = then y 
cannot satisfy both d and its negation C[, which also appears in ip since it is a symmetric formula. 
If Ci has two variables, say Q = (3/1 V 3/2)) then xW = (i ; —1,0, ... ,0), and so if (w,iW) = we 
must have w\ = W2 and hence y\ = yi- This does not satisfy either Cj = 3/1 V 3/2 or C 2 ' = ->3/i V -13/2. 
If Cj has four variables, say Ci = 3/1 V yi V 3/3 V 7/4, then jW = (3, —1, —1, —1, 0, . . . , 0), and so if 
(w,x®) = then either w = (l/yfdt)(l, 1, 1, 1, . . .) or to = (l/>/3)(-l, -1, -1, -1, • • •)• That is > 
2/1 =2/2 = 2/3 = 2/4) which does not satisfy either Cj or C|. The same argument follows if some 
variables are negated. 

We now turn to prove soundness. Assume there exists a unit vector w G M. d such that 
\{w, a:®) I > (1 — e)-^. Define an assignment yi,...,yd as follows: if ioj > set yi = true, 

otherwise set yi = false. If we had that all |tOj| ~ 1/va then this assignment would have satisfied 
all clauses of ip. This does not have to be the case, but we will show that it is so for most ioj. 
Call Wi whose absolute value is close to 1/yd "good", and ones which deviate from 1/yd "bad". 
We will show that each clause which contains only good variables must be satisfied. Since each 
bad variable appears only in a constant number of clauses, showing that there are not many bad 
variables would imply that most clauses of ij) are satisfied. 

Claim 5.6. Let B = {i:\wi~ l/y/d\ > 0.1/ Vd} be the set of "bad" variables. Then \B\ < Wed. 

Proof. For all i we have \wi\ > (1 — e)/yd since the unit vectors e\, . . . , e^ are included in the point 
set. Thus if i G B then \wi\ > 1.1/ yd. Since w is a unit vector we have 

l-l 2 . IA , D ,(1-^) 2 



i£B i^B 

which after rearranging gives \B\ < d j tzzttz yi < Wed. D 

Claim 5.7. Let Ci be a clause which does not contain any variable from B. Then the assignment 
yi,...,y d satisfies C. 

Proof. Assume by contradiction that Ci is not satisfied. Let x^ 1 ' be the point which corresponds 
to Ci. We show that (w,x^\ < (1 — e)/yd which contradicts our assumption on w. 

Consider first the case that Ci contains two variables, say Ci = (yi V 3/2)1 which gives x^ 1 ' = 
(1, —1,0, ... ,0). Since Ci is not satisfied we have y\ = y-2, = false, hence w±,W2 G (— 1/ya ± rj) 
where rj < 0.1 /y/d, which implies that | {w,x^') \ < 0.2 /y/d < (1 — e)/y r d. Similarly, suppose Ci 
contains four variables, say Ci = (y\ V 3/2 V 3/3 V 3/4), which gives i" = (3, — 1, — 1, — 1, 0, . . . , 0). 
Since Cj is not satisfied we have 3/1 = 3/2 = ?/3 = 2/4 = false, hence W\ , W2 , u>3 , W4 G (—l/y/d± rj) 
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where r\ < 0.1 /yd which implies that | (w,x^ 1 '') | < 0.6/vd < (1 — e)j\fd. The other cases where 
some variables are negated are proved in the same manner. □ 



We now conclude the proof of Theorem bA. We have \B\ < lOed Since any variable occurs 
in at most t clauses, there are at most lOedt clauses which contain a "bad" variable. As all other 
clauses are satisfied, the fraction of clauses that the assignment to yi, ■ ■ ■ ,yd does not satisfy is at 



most lOedt/m < Wet < 5 for e = 0.1(5 /t) = Q(8) since t = 30 in Theorem 5.3. □ 



6 Discussion 

A question which is not resolved in this paper is whether there exists an efficient constant factor 
approximation algorithm for the margin of FHP but for all points in the input. The authors 
have considered several techniques to try to rule out an 0(1) approximation for the problem. For 
example, trying to amplify the gap of the reduction in section ||. This, however, did not succeed. 
Even so, the resemblance of FHP to some hard algebraic problems admitting no constant factor 
approximation leads the authors to believe that the problem is indeed inapproximable to within a 
constant factor. 
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A Proof of Lemma |3.1| 



Lemma. Let w* and y* denote the optimal solution of margin of 9 and the labeling it induces. 
Let y be the labeling induced by a random hyperplane w. The probability that y = y* is at least 
n~ 



-0{8- 2 ) 



Proof. Let ci,c 2 be some sufficiently large constants whose exact values will be determined later. 
For technical reasons, assume w.l.o.g. thatP] d > c\ log(ra)# -2 . Denote by E the event that 



(w*,w) > V / c 2 log(n)0 1 - W~. 

The following lemma gives an estimate for the probability of E. Although its proof is quite standard, 
we give it for completeness. 

Lemma A.l. Let w be a uniformly random unit vector in M, d . There exists some universal constant 
C3 such that for any 1 < h < c^yd and any fixed unit vector w* it holds that 

Pt[(w,w*} >h/Vd] = 2' e ^ 2 \ 

As an immediate corollary we get that by setting appropriate values for c\ , c 2 , C3 we guarantee 
that Pr[E] > n~°^~ 2 \ 



Proof of Lemma A_J_. Notice that ~Pt[(w,w*) > h/yd] is exactly the ratio between the surface 



area of a spherical cap defined by the direction w* and hight (i.e., distance from the origin) h/yd 
and the surface area of the entire spherical cap. To estimate the probability we give a lower bound 
for the mentioned ratio. 

Define Sd, Cd,h as the surface areas of the d dimensional unit sphere and d dimensional spherical 
cap of hight h/ya correspondingly. Denote by Sd-i : r be the surface area of a d — 1 dimensional 
sphere with radius r. Then, 

lH=h/Vd 

S 



Cd, h /S d = I ^'f^ dH 



We compute the ratio -**- with the well know formula for the surface area of a sphere of 

radius r and dimension d of 2n > 2 r /T(d/2) where F is the so called Gamma function, for which 
^)! when d is even and T{d/2) = {d ~$*T)% 



T(d/2) = (-rp)! when d is even and T(d/2) = i — /Ji^ J'" when d is odd. We get that for any 



H < 1/2, 

s d-i,vi=H* = n{y/ q (1 _ #2^-2)72) = n{V j . e -dHV 2) 
Jd 
and that for any H < 1, 

VlVT^ = Q{y/ - d . (1 _ F2) (d- 2 )/ 2) = Q{V - d . e -dHV2y 

£>d 
The lower bound is given in the following equation. 



Pr 



d-lfJI^W 1 jrr . / ' ^-l.v^F.pW 



(w,w*) > h/Vd = C d , h /S d = / ^^ dH > / ^ " dH 

J JH=h/y r d &d JH=h/Vd b d 

( r 2h/Vd \ / j-2h \ 

\ Vd ■ e- dH2 ' 2 dH = ( / e- h ' 2 / 2 dti ) = fl ffc • e" 2h2 ) = C -°< ha > 

JH=h/Vd I \Jh'=h J ^ ' 



'If that is not the case to begin with, we can simply embed the vectors in a space of higher dimension. 
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Equation (*) holds since 2h/yfd < 1/2. The upper bound is due to the following. 



Pr 



{w,w*) > h/y/d 



S 



(1-1,^/l-H 2 



o 



H=h/Vd Sd 

«*/*&} W o 



dH = 



Vd-e~ dH2 / 2 dH 



H=h/y r d 



-tf/2-hh'ffl 



-n{h 2 ) 



lh'=h J \Jh'=h 

In equation (**) we used the fact that h 2 /2 + hh' < h' 2 /2 for all h! > h. The last equation holds 



since h > 1. 



□ 



We continue with the proof of Lemma 3.1. We now analyze the success probability given the 



event E has occurred. For the analysis, we rotate the vector space so that w* = (1, 0, 0, ... , 0). A 
vector x can now be viewed as x = (x\,x) where X\ = (w*,x) and x is the d — 1 dimensional vector 
corresponding to the projection of x onto the hyperplane orthogonal to w*. Since w is chosen as a 
random unit vector, we know that given the mentioned event E, it can be viewed as w = (wi,w) 
where w is a uniformly chosen vector from the d—1 dimensional sphere of radius yl — w 2 and 



w\ > yjc\og(n)0 l - y 3 . 

Consider a vector x £ W* where ||x|| < 1 such that (w*,x) > 9. As before we write x = (xi,x) 
where ||x|| < \J\ — x'f. Then 

/ \ /- ~\ ^ jclogn _ 

(x, w) = x\W\ + {x, w) > y — h (x, w) 

Notice that both x, w are vectors whose norms are at most 1 and the direction of w is chosen 



uniformly at random, and is independent of E. Hence, according to Lemma A.l, 



Pr 

w _ 



|(5,io)| > ^/clogn/va 



< n 



-n(c) 



It follows that the sign of (w,x) is positive with probability 1 — n ( c >. By symmetry we get 
an analogous result for a vector x s.t. (w*,x) < —9. By union bound we get that for sufficiently 
large c, with probability 1/2 we get that for all i G [n], sign(w,x 
event E has occurred) as required. To conclude 



(i'V 



sign(w*,x^ 1 ') (given the 



Pr [y = y*}> Pr [E]- Pr [y = y*\E]>n 



-O(0" 2 ) 



□ 



B An SDP relaxation for FHP 

The authors have considered the following SDP relaxation for the furthest hyperplane problem: 

Maximize ||z|| 

d 

s.t VKi<n \\y^xf W j f > \\zf 

d 

Eii^ii 2 = 1 ( 2 ) 

3=1 

It is easy to see that the above semidefinite program is indeed a relaxation to FHP (given an 
optimal solution w to FHP, simply set the first coordinate of W^ to Wj and the rest to zero, for all 
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j. This is a feasible solution to the SDP which achieves value 6 2 ). Nevertheless, this SDP has an 
integrality gap £l(n). To see this, observe that regardless of the input points, the SDP may always 
"cheat" by choosing the vectors W^ to be orthogonal to one another such that W^ = -7^e.j, yielding 

a solution of value ||z 2 || = 1/d. However, if d = 2 and the x^''s are n equally spaced points on 
the unit circle of M 2 , then no integral solution has value better than 0(l/n). [] The question of 
whether this relaxation can be strengthened by adding convex linear constraints satisfied by the 
integral solution is beyond the reach of this paper. 

C A note on average case complexity of FHP 

Given the hardness results above, a natural question is whether random instances of FHP are 
easier to solve. As our algorithmic results suggest, the answer to this question highly depends 
on the maximal separation margin of such instances. We consider a natural model in which the 
points {x^'}f =1 are drawn isotropically and independently at random close to the unit sphere S d ~ l . 
More formally, each coordinate of each point is drawn independently at random from a Normal 
distribution with standard deviation \j\fd: xf ~ A/"(0, \j\fd). 

Let us denote by 9 ranc i the maximal separation margin of the set of points {x^'}^ =1 . While com- 
puting the exact value of 9 ran d is beyond the reach of this paper [j , we prove the following simple 
bounds on it: 



Theorem C.l. With probability at least 2/3 

n (^) = *- = °(^> 

Proof. For the upper bound, let w be the normal vector of the furthest hyperplane achieving margin 
drand, and let yi € {±1} be the sides of the hyperplane to which the points x® belong, i.e, for all 
1 < i < n we have y, {w,x^ l >\ > 9 ran d- Summing both sides over all i and using linearity of inner 
products we get 

( w, ^2 yi ■ x^ \ > e rand ■ n (3) 

By Cauchy- Schwartz and the fact that ||u>|| = 1 we have that the LHS of (y) is at most || Y17=i Vi ' 
x^'W = \\Xy\\. Here X denotes the d x n matrix whose i'th column is x^ l \ and by y the {±l} n 
vector whose i'th entry is y^. 

9 rand ■ n < \\Xy\\ < \\y\\ • ||X|| < ^ • o{ f"^ ) = o(^) (4) 

where the second inequality follows again from Cauchy-Schwartz, and the third inequality follows 
from the facts that the spectral norm of a d x n matrix whose entries are A/"(0, 1) distributed is 
0{^Jn + \fd) w.h.p. (see |l8|]) and the fact that ||y|| = ^fri. Rearranging (||) yields the desired 
upper bound. 

For the lower bound, consider a random hyperplane defined by the normal vector ■u//|| - u/|| 
where the entries of w' distribute i.i.d. -j=J\f(0, 1). From the rotational invariance of the Gaussian 

6 In the same spirit, for general d, consider the instance whose n points are all the points of an e-net on the unit 
sphere for e = 0{n~ 1 ' d ). This instance has margin 0(e) = 0(n _1 ' d ), and therefore \\z\\ 2 = 0(n~ 2 ' d ), yielding an 
integrality gap of Q,( n j= ). 

7 The underlying probabilistic question to be answered is: what is the probability that n random points on S d_1 
all fall into a cone of measure 6 ? 
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distribution we have that (w',x^'\ also distributes -WV"(0, 1). Using the fact that w.h.p 
we have for any c > 1: 



\w 



Pr 



w,x 



(') 



< 



c ■ n 



Vd 



<Pr 



v/,x® 



< 



c ■ n 



Vd 



Pr 

Z~Af (0,1) L 



\Z\ < 



c ■ n 



O 



c ■ n 



< 2 



(5) 



For a sufficiently large constant c, a simple union bound implies that the probability that there 
exists a point x^' which is closer than l/(c • nyd) to the hyperplane defined by w is at most 1/3. 
Note that the analysis of the lower bound does not change even if the points are arbitrarily spread 
on the unit sphere (since the normal distribution is spherically symmetric). Therefore, choosing a 
random hyperplane also provides a trivial 0{n^fd) worst case approximation for FHP. □ 
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